White Wine Dataset Exploration by Dieter Annys

The dataset I’m about to explore was taken from the following reference:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

As quoted from the original information bundled with the dataset:

“In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).”

The following analysis is done on the white wine dataset.

Univariate Plots Section

First, I’ll have a look at what variables are contained in the dataset:

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
## [1] "4898 complete cases in dataset"

The dataset does not contain any incomplete records.

Apart from variable X, which seems to be a sequential ID value, I’ll look for explanations for each variable in the dataset information:

Input variables (based on physicochemical tests):

  1. fixed acidity (tartaric acid - g / dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

  2. volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

  3. citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines

  4. residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

  5. chlorides (sodium chloride - g / dm^3): the amount of salt in the wine

  6. free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

  7. total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

  8. density (g / cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content

  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

  10. sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

  11. alcohol (% by volume): the percent alcohol content of the wine

Output variable (based on sensory data):

  1. quality (score between 0 and 10)

In the bar chart above we see the ‘quality’ variable. It shows a clear normal distribution.

All other variables are numerical, and therefore something like a histogram will be useful to visualize each. To avoid being dependent on binsize however, I decide to plot density plots for each variable.

In the spirit of DRY, I made a function that will draw a chart containing the following elements:

Some variables have a large amount of outliers according to the boxplot stats. Where the rug plot still visually shows a dense amount of values, I decide to not remove the outliers. I do want to eliminate values that are very off. For that, I go through the dataset, and eliminate all values that fall outside the 99% interval. After that I plot all variables again.

I also add a variable to the set, namely bound sulfur dioxide which is the difference between total and free sulfur dioxide.

## [1] "4898 records before filtering"
## [1] "4501 records after filtering"
## [1] "8.1pct of data removed"

All plots show that almost all data is still included, so I decide to continue with the filtered dataset. 8% of the data was filtered out.

Univariate Analysis

What is the structure of your dataset?

Each record of the dataset represents a white wine that was analysed and judged. The dataset contains a total of 4898 records.

The variables measured for each wine are:

  • a variable X: sequential unique ID

  • a variable “quality” of type int, which was a subjective score given by winetasters to each wine in the dataset

  • 11 variables of type num, which were measured objective properties of each wine.

What is/are the main feature(s) of interest in your dataset?

The quality variable. The interesting thing would be to find relationships between objective properties of a wine and its subjective quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Since quality will be a reflection of taste and smell, the only one I believe we can rule out as interesting factor is density. Other than that I’m not drawing any conclusions yet on what feature will be interesting to explore.

Did you create any new variables from existing variables in the dataset?

Yes: bound.sulfur.dioxide = total.sulfur.dioxide - free.sulfur.dioxide

The realization this might be needed came from further in the analysis when I saw that there was no correlation between quality and free.sulfur.dioxide, but there was between quality and total.sulfur.dioxide. So I figured the bound part would be what added the relationship, and I doubled back to add it.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Yes, because some values fell well outside their distributions, and in the end we want to draw general conclusions about what influences the quality of white wine, I decided to filter out outliers for all relevant variables. In other words: any record that had any variable with a value that lied outside its variable’s 99% interval, was removed.

Bivariate Plots Section

First I want to get an idea of the correlation between all variables. I do this by creating a heatmap of correlation values, as plotted below. This will hopefully point me in the direction of interesting relationships.

## Warning in cor.smooth(R): Matrix was not positive definite, smoothing was
## done

## Warning in cor.smooth(R): Matrix was not positive definite, smoothing was
## done

## Warning in cor.smooth(R): Matrix was not positive definite, smoothing was
## done
## Warning in cor.smooth(r): Matrix was not positive definite, smoothing was
## done
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## In factor.scores, the correlation matrix is singular, an approximation is used
## Warning in cor.smooth(r): Matrix was not positive definite, smoothing was
## done

##                       [,1]
## density              -0.31
## chlorides            -0.23
## bound.sulfur.dioxide -0.23
## volatile.acidity     -0.18
## total.sulfur.dioxide -0.18
## residual.sugar       -0.10
## fixed.acidity        -0.08
## citric.acid          -0.01
## X                     0.02
## free.sulfur.dioxide   0.02
## sulphates             0.03
## pH                    0.09
## alcohol               0.44

If we look at the numbers specifically related to the quality variable, we can see the strongest correlation with alcohol (.44), and inversely with density (-.31). However, there appears to exist a strong correlation between density and alcohol as well (-.81). Below all 3 bivariate relations are plotted.

To draw preliminary conclusions: I’d say that instead of stating there is a relationship between quality and density, it is alcohol that is related to quality as well as to density. The intuitive idea that density would do very little to change taste/smell strengthens this conclusion.

Other variables are correlated as well because they are related by definition, e.g. total.sulfur.dioxide, bound.sulfur.dioxide and free.sulfur.dioxide, or the inverse relationship between acidity and pH:

Apart from bound.sulfur.dioxide and total.sulfur.dioxide, the largest correlation of all exists between density and residual sugar:

This most likely because sugar has a large density and will drive up the overall density significantly.

And of course something can be said for non significant relationships, e.g. citric acid does not appear to have any significant effect on quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  • Of all observed features, alcohol seems to have the strongest positive effect on quality

  • A smaller negative correlation exists between quality and chlorides as well as volatile acidity

  • Regarding sulfur dioxide: there is a smaller negative relationship between bound.sulfur.dioxide and quality, but there is none between free.sulfur.dioxide and quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

  • Density and residual sugar are largely correlated, probably due to sugar being a dense substance

  • Density and alcohol are largely correlated as well, meaning alcohol makes for a lighter wine

What was the strongest relationship you found?

The strongest relationship that wasn’t a relationship purely by definition, was between density and residual sugar (Pearson’s R of 0.84)

Multivariate Plots Section

I first want to visualize the relationship between density and its two largest correlated features: residual sugar and alcohol.

First of all, this visualization clearly shows more alcohol results in lower density, and more residual sugar results in higher density.

It also shows that a wine with high levels of residual sugar is less likely to contain a high percentage of alcohol and vice versa. This was already shown from the correlation matrix in the previous section, showing a fairly strong negative relationship between the two.

Next, I want to make a linear model based on the strongest features observed in the previous section

As an experiment to learn more about how variables interact, I want to take out the variable with the most correlation with quality, being alcohol, and then check the correlations again between the residual values of the quality variable and the other variables. This is more to learn about linear models in general as well.

Because alcohol is also correlated with other variables, e.g. density as discussed before, I expect certain correlations to be lower in case of variables related to alcohol level, and others to be equal or higher in case of more distinct influence unrelated to alcohol

##                      quality quality.residual
## X                       0.02            -0.07
## fixed.acidity          -0.08            -0.05
## volatile.acidity       -0.18            -0.22
## citric.acid            -0.01             0.02
## residual.sugar         -0.10             0.11
## chlorides              -0.23            -0.06
## free.sulfur.dioxide     0.02             0.15
## total.sulfur.dioxide   -0.18             0.02
## density                -0.31             0.04
## pH                      0.09             0.05
## sulphates               0.03             0.04
## alcohol                 0.44             0.00
## quality                 1.00             0.90
## bound.sulfur.dioxide   -0.23            -0.04
## quality.residual        0.90             1.00

As we can see, the correlation with density has already been greatly reduced because of taking alcohol out of the equation.

Equally, other correlations have gone down significantly as well. Those that remain or have gone up, are:

## `geom_smooth()` using method = 'gam'

The first model I make is only based on alcohol

## 
## Call:
## lm(formula = quality ~ alcohol, data = wine.filt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5016 -0.5467 -0.0242  0.4851  3.1350 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.554557   0.102679   24.88   <2e-16 ***
## alcohol     0.318314   0.009724   32.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7789 on 4499 degrees of freedom
## Multiple R-squared:  0.1924, Adjusted R-squared:  0.1922 
## F-statistic:  1072 on 1 and 4499 DF,  p-value: < 2.2e-16

Now we’ll add the influence of residual sugar to the model.

## 
## Call:
## lm(formula = quality ~ alcohol + residual.sugar, data = wine.filt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4652 -0.5517 -0.0076  0.4590  3.0400 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.924129   0.124463  15.459   <2e-16 ***
## alcohol        0.364030   0.010951  33.243   <2e-16 ***
## residual.sugar 0.023581   0.002678   8.807   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7724 on 4498 degrees of freedom
## Multiple R-squared:  0.2061, Adjusted R-squared:  0.2057 
## F-statistic: 583.7 on 2 and 4498 DF,  p-value: < 2.2e-16

Now just add all others to the model, just to see what comes out:

## 
## Call:
## lm(formula = quality ~ alcohol + residual.sugar + fixed.acidity + 
##     volatile.acidity + citric.acid + chlorides + free.sulfur.dioxide + 
##     bound.sulfur.dioxide + sulphates, data = wine.filt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2159 -0.5059 -0.0382  0.4464  3.1756 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.5636644  0.1986058  12.908  < 2e-16 ***
## alcohol               0.3676800  0.0122572  29.997  < 2e-16 ***
## residual.sugar        0.0234068  0.0027304   8.573  < 2e-16 ***
## fixed.acidity        -0.0470912  0.0149522  -3.149  0.00165 ** 
## volatile.acidity     -1.8642693  0.1281232 -14.551  < 2e-16 ***
## citric.acid          -0.0917222  0.1058558  -0.866  0.38627    
## chlorides            -1.3515910  0.7069441  -1.912  0.05596 .  
## free.sulfur.dioxide   0.0053679  0.0007936   6.764 1.51e-11 ***
## bound.sulfur.dioxide -0.0009329  0.0003990  -2.338  0.01942 *  
## sulphates             0.3226718  0.1059303   3.046  0.00233 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7444 on 4491 degrees of freedom
## Multiple R-squared:  0.2637, Adjusted R-squared:  0.2622 
## F-statistic: 178.7 on 9 and 4491 DF,  p-value: < 2.2e-16

A peculiar thing is that above we saw a stronger correlation between quality and bound sulfur dioxide as opposed to none between quality and free sulfur dioxide, whereas here this seems to be the reverse. This could be due to Simpson’s Paradox.

I decide to only keep the variables in the model that show the strongest relationship to quality. So ignoring the latest update, I update m2 again, this time only with volatile.acidity and free.sulfur.dioxide.

## 
## Call:
## lm(formula = quality ~ alcohol + residual.sugar + volatile.acidity + 
##     free.sulfur.dioxide, data = wine.filt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3187 -0.5099 -0.0389  0.4527  3.0757 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.0312114  0.1282752  15.835  < 2e-16 ***
## alcohol              0.3874151  0.0106922  36.233  < 2e-16 ***
## residual.sugar       0.0224019  0.0026975   8.305  < 2e-16 ***
## volatile.acidity    -1.9312019  0.1232641 -15.667  < 2e-16 ***
## free.sulfur.dioxide  0.0052818  0.0007766   6.801 1.17e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7467 on 4496 degrees of freedom
## Multiple R-squared:  0.2583, Adjusted R-squared:  0.2577 
## F-statistic: 391.5 on 4 and 4496 DF,  p-value: < 2.2e-16

Now I’d like to check my model by looking at the error term. I do this by generating a fitted quality value, and making a density plot of the difference between fitted value and actual value.

## [1] 0.8666232

I also calculated the mean and SD, and overlaid a normal distribution with these values. As you can see, the error term of our model follows a normal distribution nicely around 0. The SD however (0.744) is still fairly large, almost showing as much variation as the quality measure on its own (0.867). This would invite to explore other relationships further, perhaps with other variables. It could also suggest that a large part of quality is personal preference noisy due to the subjective nature of the variable.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The interesting part here is that the effect of any variable can change entirely depending on whether or not the effect of other variables is compensated for. For example in the case of the effect of free and bound sulfur dioxide on quality.

Were there any interesting or surprising interactions between features?

In the beginning of the analysis, I created a bound sulfur dioxide variable because at first there seemed to be a larger correlation between this and quality than free sulfur dioxide and quality. The reverse seemed true once all effects between supporting features were taken into account by the model.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I created a linear model using the features that appeared to have the strongest effect on the variable of interest. It does provide a rough calculation to predict a quality based on these features. However, the residual SD is still fairly large, suggesting there may be many more factors at play that influence the resulting quality score. My suspicion is that the subjectivity of the output variable plays a role in the variability of this score most of all.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

The graph shows the following information:

  • Y (dependent variable): Residual sugar level
  • X (independent variable): Alcohol volume percentage
  • Size and color: Density

Some observations in this graph:

  • In general, when there’s a high alcohol percentage, there’s less likely to be a high amount of residual sugar, and vice versa, when there’s a low alcohol percentage, there’s more likely to be a high amount of residual sugar.
  • The effect of density is easier to see by the size than by the color.
  • Density is the highest with high amounts of residual sugar and low amounts of alcohol. Vice versa, density is the lowest with low residual sugar and high alcohol.

Plot Two

## `geom_smooth()` using method = 'gam'

Description Two

The graph shows the following information:

  • Y (dependent variable): Quality score
  • X (independent variable): Free Sulfur Dioxide
  • Color: Alcohol volume percentage

Some observations in this graph:

  • Quality seems to drop below average for very low values of free sulfur dioxide. Then comes a maximum quality at about 35 mg / dm^3 after which quality drops off a bit again.
  • Alcohol still shows a nice trend going up together with quality

Plot Three

Description Three

This graph shows 3 relationships, each between 2 of the following 3:

  • Quality
  • Alcohol
  • Density

Some observations in this graph:

  • It simply serves to show that all these 3 are correlated with eachother. Density and Alcohol are highly correlated and Quality is correlated with both Alcohol and Density, either directly or indirectly.

Reflection

The big discovery for me is that it’s hard to know when you’re done, because there’s no one right way to go through a dataset. With every variable the amount of relationships in the data that can be explored grows exponentially. Starting with a clear question helps with narrowing the investigation.

The biggest struggle was on how to interpret the correlations and the model. It’s not simply a matter of pushing a dataset through some formulas. Some domain knowledge is required and careful study of what is related to what.

There is possible further investigation that could be done. After fitting the model, there was still a fairly large variation in the residuals. One reason this might be is that all these variables explore properties of the wine, but none explore properties of the people testing. The quality of a wine very much depends on who does the judging, which is the side of the equation that is not explored here.